May 8, 2019
\[ \gamma = \frac{N - ||\mathcal{I}||}{N} = 1 - \frac{||\mathcal{I}||}{N}\] \[ N = \textrm{ Number of Patients }\] \[\mathcal{I} = \textrm{ Set of strain identifiers }\] \[||\mathcal{I}|| = \textrm{ Actual Number of Clusters/Strains }\]
\[\alpha = \textrm{ proportion of the population sampled }\] \[ n_i = \textrm{ actual size of the }i \textrm{th cluster}\] \[ m_i = \textrm{ observed size of the }i \textrm{th cluster}\]
Notice that
\[1 \le m_i \le n_i\] and \[\sum n_i = N\] \[ \sum m_i = \alpha N\]
\[ \widehat{HAI}_{naive} = 1 - \frac{||I||}{n} \] \[ n = \textrm{ sample size } \]
\[m_i | n_i \sim \textrm{ZTHyperGeometric}(n_i, \; N-n_i, \;\alpha N) \; \textrm{for}\; i \in I\]
\[f(0|n_i) = \frac{ {n_i \choose 0}{N-n_i \choose \alpha N} }{ {N \choose \alpha N}}\]
Notice that \(\alpha\) and \(f(0|n_i)\) are inversely related and we could crudely approximate
\[f(0|n_i) \approx 1-\alpha\]
\[E[m_i] = E[ E(m_i|n_i)] = E[ (1-f(0|n_i))^{-1} \;\alpha \;n_i]\]
Utilizing this equation, can derive two different estimators.
\[\widehat{n} = \sum\widehat{n}_i\] \[ I = \textrm{ Set of observed strains }\] \[ ||I|| = \textrm{ Observed Number of Clusters/Strains }\] \[\widehat{\gamma}^* = \frac{1}{\widehat{n}}\sum_{i\in I} (\widehat{n}_i-1) = \frac{\widehat{n} - ||I||}{\widehat{n}} = 1-\frac{||I||}{\widehat{n}}\]
By repeatedly sub-sampling at \(\alpha\) rate \(J\) times and calculating \(\widehat{\gamma}^*_j\) for the \(j\)th sub-sample,
\[\bar{\delta} = \frac{1}{J}\sum\left[ \textrm{logit}(\widehat\gamma^*) - \textrm{logit}(\widehat\gamma^*_j) \right]\]
\[\widehat{\gamma} = \textrm{ilogit}\left( \textrm{logit}( \widehat{\gamma}^* ) + \bar\delta \right)\]
We performed the bias correction step on the logit scale to ensure the resulting estimator is in \([0,1]\).
\[\textrm{ilogit} \left[ \textrm{logit}(\widehat\gamma) \pm Z_{0.975}*SE(\textrm{logit}(\widehat\gamma))\right]\]
The Oxfordshire data could be reasonably modeled using a mixture of two distributions to separate the small clusters sizes from the large. We chose to model the small clusters sizes using a truncated Poisson distribution with the zero truncated out. The large cluster sizes were modeled from a logNormal distribution.
\[n_i \sim \begin{cases} \textrm{TPoisson}(\lambda) & \textrm{ with probability } 1 - \rho \\ \textrm{logNormal}(\mu, \sigma) &\textrm{ with probability } \rho \end{cases}\]
for \(i\) in \(\mathcal{I}\).